153 research outputs found

    Affine Invariant Covariance Estimation for Heavy-Tailed Distributions

    Get PDF
    In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distributionWe prove that the proposed estimator S^\widehat{\mathbf{S}} admits an \textit{affine-invariant} bound of the form (1−Δ)S≌S^≌(1+Δ)S(1-\varepsilon) \mathbf{S} \preccurlyeq \widehat{\mathbf{S}} \preccurlyeq (1+\varepsilon) \mathbf{S}in high probability, where S\mathbf{S} is the unknown covariance matrix, and ≌\preccurlyeq is the positive semidefinite order on symmetric matrices. The result only requires the existence of fourth-order moments, and allows for Δ=O(Îș4dlog⁥(d/ÎŽ)/n)\varepsilon = O(\sqrt{\kappa^4 d\log(d/\delta)/n}) where Îș4\kappa^4 is a measure of kurtosis of the distribution, dd is the dimensionality of the space, nn is the sample size, and 1−ή1-\delta is the desired confidence level. More generally, we can allow for regularization with level λ\lambda, then dd gets replaced with the degrees of freedom number. Denoting cond(S)\text{cond}(\mathbf{S}) the condition number of S\mathbf{S}, the computational cost of the novel estimator is O(d2n+d3log⁥(cond(S)))O(d^2 n + d^3\log(\text{cond}(\mathbf{S}))), which is comparable to the cost of the sample covariance estimator in the statistically interesing regime n≄dn \ge d. We consider applications of our estimator to eigenvalue estimation with relative error, and to ridge regression with heavy-tailed random design

    FALKON: An Optimal Large Scale Kernel Method

    Get PDF
    Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this paper, we take a substantial step in scaling up kernel methods, proposing FALKON, a novel algorithm that allows to efficiently process millions of points. FALKON is derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n)O(n) memory and O(nn)O(n\sqrt{n}) time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, FALKON outperforms previous state of the art solutions, which exploit parallel/distributed architectures.Comment: NIPS 201

    Learning with SGD and Random Features

    Get PDF
    Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments

    A Consistent Regularization Approach for Structured Prediction

    Get PDF
    We propose and analyze a regularization approach for structured prediction problems. We characterize a large class of loss functions that allows to naturally embed structured outputs in a linear space. We exploit this fact to design learning algorithms using a surrogate loss approach and regularization techniques. We prove universal consistency and finite sample bounds characterizing the generalization properties of the proposed methods. Experimental results are provided to demonstrate the practical usefulness of the proposed approach.Comment: 39 pages, 2 Tables, 1 Figur

    On the Sample Complexity of Subspace Learning

    Full text link
    A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribu- tion, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods.Comment: Extendend Version of conference pape

    Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

    Get PDF
    We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model

    Exponential convergence of testing error for stochastic gradient methods

    Get PDF
    We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing loss (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing error (classification error) converges exponentially fast if low-noise conditions are assumed
    • 

    corecore